Goto

Collaborating Authors

 selection event


Post-ADC Inference: Valid Inference After Active Data Collection

arXiv.org Machine Learning

The validity of statistical inference depends critically on how data are collected. When data gathered through active data collection (ADC) are reused for a post-hoc inferential task, conventional inference can fail because the sampling is adaptively biased toward regions favored by the collection strategy. This issue is especially pronounced in black-box optimization, where sequential model-based optimization (SMBO) methods such as the tree-structured Parzen estimator (TPE) and Gaussian process upper confidence bound (GP-UCB) preferentially concentrate evaluations in promising regions. We study statistical inference on actively collected data when the inferential target is constructed in a data-dependent manner after data collection. To enable valid inference in this setting, we propose post-ADC inference, a framework that accounts for the biases arising from both the active data collection process and the subsequent data-driven target construction. Our method builds on selective inference and provides valid $p$-values and confidence intervals that correct for both sources of bias. The framework applies to a broad class of ADC processes by imposing only assumptions on the observation noise, without requiring any assumptions on the underlying black-box function or the surrogate model used by the SMBO algorithm. Empirical results also show that post-ADC inference provides valid inference for data collected by GP-UCB and TPE.




$ฯ•$-test: Global Feature Selection and Inference for Shapley Additive Explanations

arXiv.org Machine Learning

We propose $ฯ•$-test, a global feature-selection and significance procedure for black-box predictors that combines Shapley attributions with selective inference. Given a trained model and an evaluation dataset, $ฯ•$-test performs SHAP-guided screening and fits a linear surrogate on the screened features via a selection rule with a tractable selective-inference form. For each retained feature, it outputs a Shapley-based global score, a surrogate coefficient, and post-selection $p$-values and confidence intervals in a global feature-importance table. Experiments on real tabular regression tasks with tree-based and neural backbones suggest that $ฯ•$-test can retain much of the predictive ability of the original model while using only a few features and producing feature sets that remain fairly stable across resamples and backbone classes. In these settings, $ฯ•$-test acts as a practical global explanation layer linking Shapley-based importance summaries with classical statistical inference.


Spacing Test for Fused Lasso

arXiv.org Artificial Intelligence

Detecting changepoints in a one-dimensional signal is a classical yet fundamental problem. The fused lasso provides an elegant convex formulation that produces a stepwise estimate of the mean, but quantifying the uncertainty of the detected changepoints remains difficult. Post-selection inference (PSI) offers a principled way to compute valid $p$-values after a data-driven selection, but its application to the fused lasso has been considered computationally cumbersome, requiring the tracking of many ``hit'' and ``leave'' events along the regularization path. In this paper, we show that the one-dimensional fused lasso has a surprisingly simple geometry: each changepoint enters in a strictly one-sided fashion, and there are no leave events. This structure implies that the so-called \emph{conservative spacing test} of Tibshirani et al.\ (2016), previously regarded as an approximation, is in fact \emph{exact}. The truncation region in the selective law reduces to a single lower bound given by the next knot on the LARS path. As a result, the exact selective $p$-value takes a closed form identical to the simple spacing statistic used in the LARS/lasso setting, with no additional computation. This finding establishes one of the rare cases in which an exact PSI procedure for the generalized lasso admits a closed-form pivot. We further validate the result by simulations and real data, confirming both exact calibration and high power. Keywords: fused lasso; changepoint detection; post-selection inference; spacing test; monotone LASSO


Evaluating the statistical significance of biclusters

Neural Information Processing Systems

Biclustering (also known as submatrix localization) is a problem of high practical relevance in exploratory analysis of high-dimensional data. We develop a framework for performing statistical inference on biclusters found by score-based algorithms. Since the bicluster was selected in a data dependent manner by a biclustering or localization algorithm, this is a form of selective inference . Our framework gives exact (non-asymptotic) confidence intervals and p-values for the significance of the selected biclusters.



Flexible Selective Inference with Flow-based Transport Maps

arXiv.org Machine Learning

Data-carving methods perform selective inference by conditioning the distribution of data on the observed selection event. However, existing data-carving approaches typically require an analytically tractable characterization of the selection event. This paper introduces a new method that leverages tools from flow-based generative modeling to approximate a potentially complex conditional distribution, even when the underlying selection event lacks an analytical description -- take, for example, the data-adaptive tuning of model parameters. The key idea is to learn a transport map that pushes forward a simple reference distribution to the conditional distribution given selection. This map is efficiently learned via a normalizing flow, without imposing any further restrictions on the nature of the selection event. Through extensive numerical experiments on both simulated and real data, we demonstrate that this method enables flexible selective inference by providing: (i) valid p-values and confidence sets for adaptively selected hypotheses and parameters, (ii) a closed-form expression for the conditional density function, enabling likelihood-based and quantile-based inference, and (iii) adjustments for intractable selection steps that can be easily integrated with existing methods designed to account for the tractable steps in a selection procedure involving multiple steps.


Statistically Significant $k$NNAD by Selective Inference

arXiv.org Machine Learning

In this paper, we investigate the problem of unsupervised anomaly detection using the k-Nearest Neighbor method. The k-Nearest Neighbor Anomaly Detection (kNNAD) is a simple yet effective approach for identifying anomalies across various domains and fields. A critical challenge in anomaly detection, including kNNAD, is appropriately quantifying the reliability of detected anomalies. To address this, we formulate kNNAD as a statistical hypothesis test and quantify the probability of false detection using $p$-values. The main technical challenge lies in performing both anomaly detection and statistical testing on the same data, which hinders correct $p$-value calculation within the conventional statistical testing framework. To resolve this issue, we introduce a statistical hypothesis testing framework called Selective Inference (SI) and propose a method named Statistically Significant NNAD (Stat-kNNAD). By leveraging SI, the Stat-kNNAD method ensures that detected anomalies are statistically significant with theoretical guarantees. The proposed Stat-kNNAD method is applicable to anomaly detection in both the original feature space and latent feature spaces derived from deep learning models. Through numerical experiments on synthetic data and applications to industrial product anomaly detection, we demonstrate the validity and effectiveness of the Stat-kNNAD method.


Exact Post Model Selection Inference for Marginal Screening

Neural Information Processing Systems

We develop a framework for post model selection inference, via marginal screening, in linear regression. At the core of this framework is a result that characterizes the exact distribution of linear functions of the response y, conditional on the model being selected ("condition on selection" framework). This allows us to construct valid confidence intervals and hypothesis tests for regression coefficients that account for the selection procedure. In contrast to recent work in high-dimensional statistics, our results are exact (non-asymptotic) and require no eigenvalue-like assumptions on the design matrix X. Furthermore, the computational cost of marginal regression, constructing confidence intervals and hypothesis testing is negligible compared to the cost of linear regression, thus making our methods particularly suitable for extremely large datasets. Although we focus on marginal screening to illustrate the applicability of the condition on selection framework, this framework is much more broadly applicable. We show how to apply the proposed framework to several other selection procedures including orthogonal matching pursuit and marginal screening+Lasso.